Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a random seed specific to datagen cases #9441

Merged
merged 17 commits into from
Nov 15, 2023

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Oct 13, 2023

Closes #9282

Please note that there are a number of overrides in the integration tests, and I wasn't planning on changing these, but I can I need to look into why the are seeding from 0 each time.

src/main/python/collection_ops_test.py: step_gen.start(random.Random(0))
src/main/python/collection_ops_test.py: gen.start(random.Random(0))
src/main/python/json_fuzz_test.py:_name_gen.start(random.Random(0))

src/main/python/cache_test.py: def op_df(spark, length=2048, seed=0):
src/main/python/generate_expr_test.py:def four_op_df(spark, gen, length=2048, seed=0):
src/main/python/expand_exec_test.py: def op_df(spark, length=2048, seed=0):

I did not add the random seed to the test name, but I easily can. The reason I was leaning against this is that the user can easily set the seed, so the seed in the name of the test wouldn't match if they have changed something internally. Every test would use the same seed, and it's overwrite-able via environment variable or via the seed parameter in datagen.

The variable we can use to override this in manual runs is: SPARK_RAPIDS_TEST_DATAGEN_SEED

@abellina abellina requested a review from revans2 October 13, 2023 15:00
@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

I have updated the test names to have the seed in the test name, here's an example:

../../src/main/python/dpp_test.py::test_dpp_via_aggregate_subquery[false-0-parquet][DATAGEN_SEED=1697219724, INJECT_OOM, IGNORE_ORDER]

@abellina abellina added the test Only impacts tests label Oct 13, 2023
@abellina
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Oct 13, 2023

If I could have everything I wanted I would change the datagen so the seed is not passed into it at all. Then if someone wants to override the seed they use an annotation to force it to a specific seed.

revans2
revans2 previously approved these changes Oct 13, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally.

@@ -113,7 +113,6 @@ def is_parquet_testing_tests_forced():
_inject_oom = None

def should_inject_oom():
global _inject_oom
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the global not needed any more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I owed @gerashegalov from a long time ago #7925 (comment).

Since we don't need to change the variable, it's not needed.

@abellina
Copy link
Collaborator Author

abellina commented Oct 13, 2023

If I could have everything I wanted I would change the datagen so the seed is not passed into it at all. Then if someone wants to override the seed they use an annotation to force it to a specific seed.

If you had multiple datagens with the argument approach one can set a seed for each datagen invocation. With an annotation, I'd have to check if a function invocation can be annotated, though at that point I am not sure there's much point to that (though I could be wrong)

@abellina
Copy link
Collaborator Author

I am seeing a failure in CI:

FAILED ../../src/main/python/arithmetic_ops_test.py::test_decimal_bround[Float][DATAGEN_SEED=1697220957, INJECT_OOM, INCOMPAT, APPROXIMATE_FLOAT]

So I'll take a look at reproing this locally.

@abellina
Copy link
Collaborator Author

This is another manifestation of #9350. That said, that's the only test that failed in this CI run. If we did have that parameter annotation like @revans2 suggested, we could at least print in the test that we meant to override the seed... I'll check on that. But worse case I may skip this test since we have a ticket for it.

@abellina
Copy link
Collaborator Author

BTW, the test failure is easy to repro via:

SPARK_RAPIDS_TEST_DATAGEN_SEED=1697220957 ./run_pyspark_from_build.sh -k test_decimal_bround

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

@firestarman mind taking a look at this diff 7fc82d6, I am not sure if the random with seed=0 was set on purpose.

@abellina
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator

@firestarman mind taking a look at this diff 7fc82d6, I am not sure if the random with seed=0 was set on purpose.

I probably just copied it from the old code and it is ok to remove it if no failures are found.

@abellina
Copy link
Collaborator Author

build

1 similar comment
@abellina
Copy link
Collaborator Author

abellina commented Nov 2, 2023

build

@revans2
Copy link
Collaborator

revans2 commented Nov 3, 2023

Looks like there are still some more tests that need to be fixed/xfailed + issues filed before we can check this in.

@abellina
Copy link
Collaborator Author

abellina commented Nov 3, 2023

Looks like there are still some more tests that need to be fixed/xfailed + issues filed before we can check this in.

Yes, there's a list of issues that are failing in CI, some I can repro locally, some I can't. I should be able to get back to this next week.

@abellina
Copy link
Collaborator Author

abellina commented Nov 13, 2023

Follow on issue: #9703

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

1 similar comment
@abellina
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. I just want to be sure that there is a follow on issue for every test that has a hard coded seed, or a good explanation as to why it is hard coded.

@@ -91,11 +91,12 @@ def do_join(spark):
@pytest.mark.parametrize('data_gen', all_gen, ids=idfn)
@pytest.mark.parametrize('enable_vectorized_conf', enable_vectorized_confs, ids=idfn)
@ignore_order
@datagen_overrides(seed=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no reason here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh whoops, let me add that.. not intended

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am re-running CI with these removed. I can't repro these locally.


#sort locally because of https://github.com/NVIDIA/spark-rapids/issues/84
# After 3.1.0 is the min spark version we can drop this
@ignore_order(local=True)
@datagen_overrides(seed=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the reason for these?

data_gen.start(rand)
data = [data_gen.gen() for index in range(0, length)]
return data

def gen_df(spark, data_gen, length=2048, seed=0, num_slices=None):
def gen_df(spark, data_gen, length=2048, seed=None, num_slices=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a follow on issue to remove seed for gen_df and force us to go through the annotation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abellina
Copy link
Collaborator Author

build

1 similar comment
@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit e4fdd84 into NVIDIA:branch-23.12 Nov 15, 2023
36 checks passed
@abellina abellina deleted the add_test_seed_for_datagen branch November 15, 2023 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Look at changing the seed for integration tests each time
3 participants